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Abstract — Let P — {p{i)} be a measure of strictly positive 
probabilities on the set of nonnegative integers. Although the 
countable number of inputs prevents usage of the Huffman 
algorithm, there are nontrivial P for which known methods 
find a source code that is optimal in the sense of minimizing 
expected codeword length. For some applications, however, a 
source code should instead minimize one of a family of nonlinear 
objective functions, /9-exponential means, those of the form 
log a ^pf*)® > where n(i) is the length of the ith codeword 
and a is a positive constant. Applications of such minimizations 
include a problem of maximizing the chance of message receipt 
in single-shot communications (a < 1) and a problem of 
minimizing the chance of buffer overflow in a queueing system 
(a > 1). This paper introduces methods for finding codes optimal 
for such exponential means. One method applies to geometric 
distributions, while another applies to distributions with lighter 
tails. The latter algorithm is applied to Poisson distributions. Both 
are extended to minimizing maximum pointwise redundancy. 

I. Introduction, Motivation, and Main Results 

If probabilities are known, optimal lossless source coding 
of individual symbols (and blocks of symbols) is usually 
done using David Huffman's famous algorithm [1]. There 
are, however, cases that this algorithm does not solve [2]. 
Problems with an infinite number of possible inputs — e.g., 
geometrically-distributed variables — are not covered. Also, 
in some instances, the optimality criterion — or penalty — 
is not the linear penalty of expected length. Both variants of 
the problem have been considered in the literature, but not 
simultaneously. This paper discusses cases which are both 
infinite and nonlinear. 

An infinite-alphabet source emits symbols drawn from the 
alphabet = {0,1,2,...}. (More generally, we use X to 
denote an input alphabet whether infinite or finite.) Let P = 
{p(i)} be the sequence of probabilities for each symbol, so that 
the probability of symbol i is p(i) > 0. The source symbols 
are coded into binary codewords. The codeword c(i) £ {0, 1}* 
in code C, corresponding to input symbol i, has length n(i), 
thus defining length distribution N. 

Perhaps the most well-known such codes are the optimal 
codes derived by Golomb for geometric distributions [3], [4]. 
There are many reasons for using infinite-alphabet codes rather 
than codes for finite alphabets, such as Huffman codes. The 
most obvious use is for cases with no upper bound — or at 
least no known upper bound — on the number of possible 
items. In addition, for many cases it is far easier to come up 



with a general code for integers rather than a Huffman code 
for a large but finite number of inputs. Similarly, it is often 
faster to encode and decode such well-structured codes. For 
these reasons, infinite-alphabet codes and variants of them are 
widely used in image and video compression standards [5], [6], 
as well as for compressing text, audio, and numerical data. 

To date, the literature on infinite-alphabet codes has consid- 
ered only finding efficient uniquely decipherable codes with 
respect to minimizing expected codeword length X^P(*) n (*)- 
Other utility functions, however, have been considered for 
finite-alphabet codes. Campbell [7] introduced a problem 
in which the penalty to minimize, given some continuous 
(strictly) monotonic increasing cost function ip(x) : M+ — » 
K+, is 

L(P,N,<p) = <p- 1 ^>(*M«(*))1 

and specifically considered the exponential subcases with 
exponent a > 1: 

L a (P,iV)4log a £>(i)a»« (1) 

i 

that is, <p(x) = a x . Note that minimizing penalty L is also an 
interesting problem for < a < 1 and approaches the standard 
penalty 2~Zi-P(*) n (*) f° r a ~ * 1 I7L While (f(x) decreases 
for a < 1, one can map decreasing ip to a corresponding 
increasing function (p(l) = ip max - tp(l) (e.g., for ip max 
1 j without changing the penalty value. Thus this problem, 
equivalent to maximizing J2i p(i) a ™ > is a subset of those 
considered by Campbell. All penalties of the form ([U are 
called /3-exponential means, where j3 = log 2 a [8, p. 158]. 

Campbell noted certain properties for /3-exponential means, 
but did not consider applications for these means. Applications 
were later found for the problem with a > 1 [9]— [ 11]. These 
applications relate to a problem in which we wish to minimize 
the probability of buffer overflow in communications; this is 
discussed in the full version of this paper [12]. Also discussed 
in the full version is an application for a < 1 introduced in 
[13], a problem of maximizing the chance of message receipt 
in single-shot communications. 

One can solve any instance of the exponential penalty with 
a finite number of inputs using a linear-time algorithm found 
independently by Hu et al. [14, p. 254], Parker [15, p. 485], 



and Humblet [16, p. 25], [10, p. 231], although only the last 
of these considered a < 1. We present the exponential-penalty 
algorithm here; even though it cannot be used for an infinite 
alphabet, it can be used to derive and show the optimality of 
infinite-alphabet codes: 

Procedure for Exponential Huffman Coding 
This procedure minimizes ([T} for any positive a / 1 and 
\X\ < oo, even if the "probabilities" do not add to 1. We refer 
to such arbitrary positive inputs as weights, denoted by w(i) 
instead of p(i): 

1) Each item i has weight w(i) € Wx, where X is the 
(finite) alphabet and Wx — { w {i)} is the set of all such 
weights. Assume each item i has codeword c(i), to be 
determined later. 

2) Combine the items with the two smallest weights w(j) 
and w(k) into one item with the combined weight 
w(j) = a ■ (w(j) + w(k)). This item has codeword 
c(j), to be determined later, while item j is assigned 
codeword c(j) — c(j)0 and k codeword c(fc) = c(j)l. 
Since these have been assigned in terms of c(j), replace 
w(j) and w(k) with w(j) in Wx to form W%- 

3) Repeat procedure, now with the remaining codewords 
(reduced in number by 1) and corresponding weights, 
until only one item is left. The weight of this item 
is w(i)a n ( l \ All codewords are now defined by 
assigning the null string to this trivial item. 

Optimality of the algorithm is justified as in Huffman 
coding, in that an exchange argument can be used to show 
that an optimal code exists for which the least likely two 
codewords differ in only their final bit, allowing a reduction 
to the equivalent smaller problem that linearly combines their 
weights. This algorithm can be modified to run in linear time 
(to input size) given sorted weights in the same manner as 
Huffman coding [17]. 

Note that this algorithm assigns an explicit weight to each 
node of the resulting code tree implied by having each 
item represented by a node with its parent representing the 
combined items: If a node is a leaf, its weight is given by 
the associated probability; otherwise its weight is defined 
recursively as a times the sum of its children. This concept is 
useful in visualizing both the coding procedure and its output. 

It is also worthwhile to note that a < 0.5 is degenerate, 
always resulting in the unary code (for infinite inputs) or 
a unary-like code (for finite inputs) being optimal for any 
probability distribution. The unary code has ones terminated 
by a zero, i.e., codewords of the form {1*0 : i > 0}. The 
unary-like code is a truncated unary code, that is, a code with 
identical codewords to the unary code except for the longest 
codeword, which is of the form l^l -1 . For the unary-like 
code, optimality for a < 0.5 can be shown using the coding 
procedure; the smallest two items, j and k, are combined, and 
the resulting item has weight a ■ (w(j) + w(k)). This is no 
larger than the larger of the constituent weights, meaning that 
the resulting item will be combined with third-smallest item, 
and so forth, resulting in a unary-like code. Taking limits, 



informally speaking, results in a unary limit code; formally, 
this is a straightforward corollary of Theorem[2]in Section Hill 
If a > 0.5, a code with finite penalty exists if and only if 
Renyi entropy of order a(a) = (1 + log 2 a)~ 1 is finite [18]. 
It was Campbell who first noted the connection between the 
optimal code's penalty, L a (P, N*), and Renyi entropy 

H a (P) 4 j^log 2 Y, i& xP(i) a 
^H a(a) (P) = i+^lo g2 E,^^) (1+log2arl - 

This relationship is 

H a{a) (P) < L a (P,N*) < H a(a) (P) + 1 

which should not be surprising given the similar relationship 
between Huffman coding and Shannon entropy [19], which 
corresponds to a — > 1, H\(P) [20]. 

One must be careful regarding the meaning of an "optimal 
code" when there are an infinite number of possible codes 
under consideration. One might ask whether there must exist 
an optimal code or if there can be an infinite sequence of 
codes of decreasing penalty without any code achieving the 
limit penalty value. Fortunately the answer is the former, the 
proof being a special case of Theorem 2 in [18]. The question 
is then how to find one of these optimal source codes given 
parameter a and probability measure P. 

As in the linear case, this is not known for general P, 
but can be found for certain common distributions. In the 
next section, we consider geometric distributions and find that 
Golomb codes are optimal, although the optimal Golomb code 
for a given probability mass function varies according to a. 
The main result of Section [II] is that, for pe(i) — (l — 9)6 1 and 
a G K + , Gfc, the Golomb code with parameter k, is optimal 
for 

k = max (1, f- log e a - log e (l + 6)]). 

In Section [III] we consider distributions that are relatively 
light-tailed, that is, that decline faster than certain geometric 
distributions. If there is a nonnegative integer r such that for 
all j > r and i < j, 

p(i) > max p(j), P{k)a k ~ j 

\ k=j+l J 

then an optimal binary prefix code can be found which is a 
generalization of the unary code. A specific case of this is 
the Poisson distribution, where an aforementioned r is given 
by r = max([2aA] -2, |~eA] - 1) fox p\{i) = \ l e - x /i\. Sec- 
tion [IV]discusses the maximum pointwise redundancy penalty, 
which has a similar solution with light-tailed distributions and 
for which the Golomb code Gfc with k = \— l/log 2 0] is 
optimal for with geometric distributions. Complete proofs and 
illustrations, as well as additional results, are given in the full 
version [12]. 



II. Geometric Distribution with Exponential 
Penalty 

Consider the geometric distribution 

peii) = (l-6)6 i 

for parameter E (0, 1). This distribution arises in run-length 
coding as well as in other circumstances [3], [4]. 

For the traditional linear penalty, a Golomb code with 



parameter fc 

nk 



or Gfc — is optimal for 



Qk+l < i 



< 



v 1 + k . Such a code consists of a unary prefix followed by 
a binary suffix, the latter taking one of fc possible values. If k 
is a power of two, all binary suffix possibilities have the same 
length; otherwise, their lengths cr(i) differ by at most 1 and 
Y^j 2~ CT W = 1. Such binary codes are called complete codes. 
This defines the Golomb code; for example, the Golomb code 
for k = 3 is: 
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where the space in the code separates the unary prefix from 
the complete suffix. In general, codeword j for Gfc is of the 
form {1 Lj/ fc J 06(j mod fc, k) : j > 0}, where b(j mod fc, fc) is 
a complete binary code for the (j — k[j/k\ + l)th of fc items. 

It turns out that such codes are optimal for the exponential 
penalty: 

Theorem 1: For a E M+, if 



lk+1 



1 

< - < 

a 



(2) 



for fc > 1, then the Golomb code Gfc is the optimal code for 
Pg. If no such fc exists, the unary code is optimal. 

The proof of optimality (in full version [12]) uses the 
procedure for exponential Huffman coding to find an optimal 
exponential Huffman code for a sequence of similar finite 
weight distributions. Define an m-reduced geometric source 
W m as: 



.(0 



(1- 



< i < m 

m < i < m + fc 



for any m > — 1. It can be shown that this distribution has 
an optimal code with lengths n(0) through n(m) that are 
identical to the Golomb code in question. One can then show 
that the optimal code for the geometric distribution must have 
a penalty between that for the Golomb code for the geometric 
distribution and the optimal code for W m (for any m). Since 



the latter two penalties approach equality as m — ► oo, the 
Golomb code must be optimal. 

This rule for finding an optimal Golomb Gfc code is 
equivalent to 

fc = max(l, |"-log 8 a-log 8 (l + 0)"|). 

This is a generalization of the traditional linear result since 
this corresponds to a — > 1. Cases in which the left inequality 
of (ffjl is an equality have multiple solutions, as with linear 
coding; see, e.g., [21, p. 289]. 

It is equivalent for the bits of the unary prefix to be reversed, 
that is, to use {0^ /fc J 16(j mod fc, fc) : j > 0} (as in [4]) 
instead of {1 Lj'/^J 06(j mod fc, fc) : j > 0} (as in [3]). The 
latter has the advantage of being alphabetic, that is, i > j if 
and only if c(i) is lexicographically after c(j). 

A little algebra reveals that, for a distribution Pg and a 
Golomb code with parameter fc (lengths Nk), 

- 0)6 l a { \ 1 
(°-i) 



L a (Pe,N k ) 



g + log a ( 1 4 



+fl) 



l-a8 k 



(3) 



where g = [log 2 fcj + 1 and z = 2 s — fc. Therefore, 
Theorem Q] provides the fc that minimizes OJ. If a > 0.5, 
the corresponding Renyi entropy is 

2 Q 

H a[a) (Pg) = log a (1 _ ga(n))1/a(a) (4) 

where we recall that a(a) = (l + log 2 a) _1 . (Again, a < 0.5 is 
degenerate, an optimal code being unary with no correspond- 
ing Renyi entropy.) 

In evaluating the effectiveness of the optimal code, one 
might use the following definition of average pointwise re- 
dundancy (or just redundancy): 

R a (N,P)^L a (PN)-H a(a) (P). 

For nondegenerate values, we can plot the R a (Ng a ,Pg) 
obtained from the minimization. This is done for a > 1 and 
a < 1 in Fig.Q] As a — ► 1, the plot approaches the redundancy 
plot for the linear case, e.g., [4], reproduced as Fig. |2] In 
many potential applications of nonlinear coding — such as 
the aforementioned for a > 1 [9]— [1 1] and a < 1 [12], [13] 
— a is very close to 1. Since this analysis shows that the 
Golomb code that is optimal for given a and is optimal not 
only for these particular values, but for a range of a (fixing 0) 
and a range of (fixing a), the Golomb code is, in some sense, 
much more robust and general than previously appreciated. 

III. Other Infinite Sources 

In this section we consider another type of probability dis- 
tribution for binary coding, a type with a light tail. Humblet's 
approach [22], later extended in [23], uses the fact that there 
is always an optimal code consisting of a finite number of 
nonunary codewords for any probability distribution with a 
relatively light tail, one for which there is an r such that, for 
all j > r and i < j, p(i) > p(j) and p(i) > Y.kLj+iP( k )- 
Due to the additive nature of Huffman coding, the unary part 
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Fig. 1. Redundancy of the optimal code for the geometric distribution with the exponential penalty (parameter a). R a (Ng , Pg) = L a (Pg, N@ ) — 
ff a ( a j(Pg), where a(a) = (1 + log 2 a) — 1 , is the geometric probability sequence implied by 9, and Ng a is the optimal length sequence for distribution 
Pg and parameter a. 
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Fig. 2. Redundancy of the optimal code for the geometric distribution with 
the traditional linear penalty. 



can be considered separately, and the remaining codewords 
can be found via the Huffman algorithm. Once again, this has 
to be modified for the exponential case. 

We wish to show that the optimal code can be obtained 
when there is a nonnegative integer r such that, for all j > r 
and i < j, 



p(i) > max ( p(j), ^2 P(k)a k J 
fc=i+i 

The optimal code is obtained by considering the reduced 
alphabet consisting of symbols 0, 1, . . . , r + 1 with weights 



1. 



(5) 



Apply exponential Huffman coding to this reduced set of 
weights. For items through r, the Huffman codewords for 
the reduced and the infinite alphabets are identical. Each other 
item i > r has a codeword consisting of the reduced codeword 



for item r + 1 (which, without loss of generality, consists of 
all l's) followed by the unary code for i — r — 1. We call such 
codes unary-ended. 

Theorem 2: Let p(-) be a probability measure on the set of 
nonnegative integers, and let a be the parameter of the penalty 
to be optimized. If there is a nonnegative integer r such that 
for all j > r and i < j, 



p(i) > max p(j), 



OO 

E 



p{k)a 



k-j 



(6) 



then there exists a minimum-penalty binary prefix code with 
every codeword j > r consisting of j — x l's followed by one 
for some fixed nonnegative integer x. 

The proof of optimality (in full version [12]) is similar to 
that for the geometric distribution. In this case, for a given 
m > — 1, the corresponding codeword weights are 



k=i 



p(k)a k 



k — i 



where i max = r + m + 2. For a < 1, the proof is outlined 
similarly to that for the geometric case. For a > 1, the key 
is to note that the combined weight of a node in an optimal 
code is upper-bounded by the weight of a node with the same 
children in a code for which the node is the root of a unary 
subtree. This allows an inductive proof that the unary subtree 
— and thus the proposed code — is optimal. 

Consider the example of optimal codes for the Poisson 
distribution, 

P\W = T— • 

How does one find a suitable value for r in such a case? It has 
been shown that r > \eX] — 1 yields p(i) > p(j) for all j > r 
and i < j, satisfying the first condition of Theorem |2] [22]. 



Moreover, if, in addition, j > [2a A] —1 (and thus j > aX— 1), 
then 



< 



< 
< 



P(j) 



p(j) 

p{i). 
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j+i TJTT) 

a\ 
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a A 



Thus, since we consider j > r, r — max( [2aA] — 2, [eA] — 1) 
is sufficient to establish an r such that the above method yields 
the optimal infinite-alphabet code. 

In order to find the optimal reduced code, use 



W-i 



(r+1) = p{k)a k ~ 



k=r+l 



a -r e X{a-l) 



k=0 



p(k)a 



k-r 



For example, consider the Poisson distribution with A = 1. We 
code this for both a = 1 and a = 2. For both values, r = 2, 
so both are easy to code. For a = 1, tu_i(3) = 1 — 2.5e~ 1 w 
0.0803 . . ., while, for a = 2, U?_i(3) = 0.25e - l^ber 1 w 
0.2197 . . .. After using the appropriate Huffman procedure on 
each reduced source of 4 weights, we find that the optimal 
code for a = 1 has lengths N = {1, 2, 3, 4, 5,6,...} — those 
of the unary code — while the optimal code for a = 2 has 
lengths N = {2,2,2,3,4,5,...}. 

IV. Redundancy penalties 

It is natural to ask whether the above results can be extended 
to other penalties. One penalty discussed in the literature is that 
of maximal pointwise redundancy [24], in which one seeks to 
find a code to minimize 

R*(N, P) = max[n(i) + iog 2 p(i)]. 

This can be shown to be a limit of the exponential case, as 
in [25], allowing us to analyze it using the same techniques 
as exponential Huffman coding. This limit can be shown by 
defining dth exponential redundancy as follows: 

R d (N,P) 4 Ilog 2 ^ ie; ,p( J )2 !i (»W+iog 2P W) 

= Ilog 2 E ie *P« 1+d 2 d "« 

Thus R*(N, P) = lim^oc R d (N, P), and the above methods 
should apply in the limit. In particular, the Golomb code Gk 
for k = \— l/log 2 0] is optimal for minimizing maximum 
pointwise redundancy for Pg. For light tails, a similar con- 
dition to © holds; in this case, we find an r such that, 



and 



for all i < r, p(i) > p{r) 
for all j > r, p(j) > 2p(j + 1). 
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